Julian Heinrich, VISUS - Uni Stuttgart, Julian.Heinrich@vis.uni-stuttgart.de
Christoph Mueller, VISUS - Uni Stuttgart, Christoph.Mueller@vis.uni-stuttgart.de
Guido Reina, VISUS - Uni Stuttgart, Guido.Reina@vis.uni-stuttgart.de
SpRay was
developed during the master thesis of Julian Heinrich at the
Eberhard-Karls-Universität Tübingen and was originally targeted at the visual
exploration of gene expression data. SpRay is a generic visual analytics tool
using a tight integration of interactive visualization and the statistical
programming language R.
See Linq was
developed during the VAST ’09 contest by Guido Reina and Christoph Müller
at the Visualization Institute of Universität Stuttgart (VISUS). It is based on
a queryable model and employs .Net mechanisms to integrate interactively
formulated queries as data sources for linking and brushing. The visualization
was developed using rapid prototyping and supports time-based events, and even
though the glyphs are customized for this contest, the visualization can be
easily adopted to other tasks.
Video:
ANSWERS:
MC1.1:
Identify which computer(s) the employee most likely used to send information to
his contact in a tab-delimited table which contains for each computer
identified: when the information was sent, how much information was sent and
where that information was sent.
MC1.2:
Characterize the patterns of behavior of suspicious computer use.
To identify suspicious
computer use, we first tried to identify irregular network traffic and then related
it to the badge log to single out the potential mole.
Using SpRay, we visualized
the connection matrix source/destination IP as tables and parallel coordinates.
The visualization reveals one connection count outlier, 37.170.30.250, and one
count that is extremely regular: 37.170.100.200 (Figure 1). Inspecting the
traffic itself in a table shows that only port 25 is used. Selecting this port
in the PC plot confirms that actually all mail traffic is directed at
37.170.30.250. We hypothesized that data theft via mail is too risky (usually
mail is logged), and thus exluded all mail traffic. Traffic to 37.170.100.200
is equally caused by all employee machines and its count in the linked table
approximately matches the number of working days in a month (20/21), therefore
it is probably not suspicious. Using the total upload size in the connection
matrix reveals another outlier in parallel coordinates: the top 13 uploads go
to 100.59.151.133, which we thereby define as suspicious.
Figure 1: Parallel coordinate plot linked
with R backend, displaying the connection count per destination address
To find a relation to the
badge traffic, we devised a visualization that presents ip traffic in context
with badge events (see Figure 2). Since our data model checks for basic
consistency, like a strict succession of prox-in-classified and
prox-out-classified, we found badges 38 and 49 as well as 30 to be
inconsistent. The former two have multiple presences in the classified room
(missing prox-out-classifieds) and the latter negative presence (missing
prox-in-classifieds). We adjusted this by inserting missing prox-out-classifieds
just before the next prox-in-building and missing prox-in-classifieds just
after the previous prox-in-buildings programmatically. These virtual events are
visualized in red to invalidate the potential ensuing suspicious traffic.
Figure 2: Badge and IP traffic visualized
together, X-axis represents time from left to right, Y-axis represents
employees
Figure 3: Top left: dataset with the
unmatched prox-in and prox-out-classified events caused by e.g. piggybacking.
Bottom right: after programmatic insertion of virtual events, the classified
presence of employees #30, #38, and #49 can be determined, but false positives
for network traffic on machine #30 appear (empty circles).
Highlighting the traffic to
37.170.100.200, it becomes evident that it is either caused by some kind of
login process or some bulletin system which, if accessed, every employee
accesses once a day as the first traffic from his machine. We verified this by
formulating an exact query to the underlying data model. If no access to this
machine occurs, the employee enters the classified room before generating any
IP traffic, so the necessary information must be available there as well. We
verified the 21 occurrences manually in the visualization. The remaining
exceptions are only three and represent uploads to the already suspicious
100.59.151.133.
Highlighting the remaining
15 uploads to 100.59.151.133, one can see in the visualization that some of
them happen while the computer owner is in the classified room. These uploads
are also conspicuously isolated. Hence we defined that the computer owner
probably did not trigger the uploads. We supposed that the mole does not
necessarily use his own computer to upload the stolen data to obfuscate his
actions. We wanted to find out which employee has enough time to trigger these
suspicious uploads. We defined a variable time before and after each upload
during which the suspect is not allowed to be in the classified room. Applying
this filter programmatically with a 2 minutes time before and after the upload,
only employees #27 and #30 remain. We manually examined the uploads for both
suspects and found that on 01/22, #27 entered the building after the upload and
also does not exhibit any other network traffic before badging in. Therefore
#27 is not the suspect.
Figure 4: Suspicious uploads to 100.59.151.133 along with inconsistent
traffic (circled). Inconsistencies for #30 stem from adjusted data. Only the
two potential moles, #27 and #30, are never in the classified room during these
uploads. The red box shows a magnification of the incident when the suspicious
upload on 01/22 happens before #27 enters the embassy.
For each upload event, we
then checked whether the suspicious uploads were conducted without the computer
owner or his/her roommate being present. Mostly potential disturbers are inside
the classified room or their machines generate no traffic for a significant
time, which we interpreted as absence. When #30 uses his neighbor's machine
(37.170.100.31), his own machine shows traffic regardless, which supports our
suspicion. On 01/24 there are two very risky uploads, with very little time to
complete the upload without being surprised by the machine owner or his neighbor
(10 minutes and 3 minutes). This might be in consequence of an increasing
talkativeness of the mole towards the end of the month: As the amount of data
transmitted per day increases by a factor of 3 between 01/08 and 01/31, more
uploads from different machines are required to send everything. However, all
rooms are across the aisle from his own and we hypothesized at least a
semi-automatic upload process as to minimize the mole's time at another machine
(because of contract termination).
Figure 5: Spatial distribution of the
machines used to transmit data. Most of the machines are within easy reach from
the desk of employee #30.
Our conclusion is that
employee #30 is the mole who uploads classified information 18 times to 100.59.151.133
using 12 different machines.
We derive the following
patterns from our analysis:
- Mail traffic is too easy
to track and too risky for leaking confidential data
- Data theft is defined by
large RequestSize in the IP traffic
- Information is always
sent to the same destination
- Information is always
sent on Tuesday and Thursday.
- The mole never uses his
own machine for uploads
- The mole can only hijack
machines when alone (termination)
We also tried alternative
approaches to find additional suspicious uploads with programmatic queries:
- The obvious pattern of an
employee leaving the classified room and after a certain time t generating an
upload. Setting t to 10 minutes revealed no significant clusters of source or
destination. The highest uploads are individuals or machines that are commonly
accessed from at least half of the employee computers.
- We also tried to search
for the request/response ratio instead of the absolute of the request. Also in
this case the top ten traffic consisted of only commonly-accessed machines and
the already known 100.59.151.133. So this approach only confirmed our primary
suspect.